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Abstract 

Background: Pre-symptomatic prediction of disease and drug response based on genetic testing is a critical 
component of personalized medicine. Previous work has demonstrated that the predictive capacity of genetic 
testing is constrained by the heritability and prevalence of the tested trait, although these constraints have only 
been approximated under the assumption of a normally distributed genetic risk distribution. 

Results: Here, we mathematically derive the absolute limits that these factors impose on test accuracy in the 
absence of any distributional assumptions on risk. We present these limits in terms of the best-case receiver- 
operating characteristic (ROC) curve, consisting of the best-case test sensitivities and specificities, and the AUC (area 
under the curve) measure of accuracy. We apply our method to genetic prediction of type 2 diabetes and breast 
cancer, and we additionally show the best possible accuracy that can be obtained from integrated predictors, 
which can incorporate non-genetic features. 

Conclusion: Knowledge of such limits is valuable in understanding the implications of genetic testing even before 
additional associations are identified. 



Background 

Accurate pre-symptomatic prediction of disease and 
drug response is a vital component of personalized 
medicine, which could allow for improved clinical 
decision-making and targeted prevention strategies, eas- 
ing both the burden and costs of disease [1]. Already, 
several companies offer consumers personalized risk 
assessments, lifestyle recommendations, and nutraceuti- 
cals' based on their genetic profiles [2]. Unfortunately, 
most genetic factors associated with common traits ex- 
plain only a small portion of the phenotypic variance 
(the "missing heritability" problem [3]), making genetic 
prediction currently difficult [4]. Investment into studies 
that assay rare variants [5] and the use of informative 
polymorphisms that do not individually pass stringent 
statistical tests of association [6] can improve the accur- 
acy of predictions, but the extent to which predictions 
can be improved is unclear. Thus, identifying the bounds 
on the accuracy of predictive genetic testing based on 
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readily-known disease parameters (such as prevalence 
and heritability) can be an invaluable planning tool 

Although the accuracy of a medical test can be mea- 
sured in many ways, the concepts of sensitivity and spe- 
cificity are paramount [7]. Frequently, the test result is 
continuous (e.g. the individuals predicted risk), while 
the clinical decision and true outcome are binary (e.g. ei- 
ther the person will get sick or not), so that different 
thresholds of the test result yield different pairs of sensi- 
tivity and specificity. The receiver operator characteristic 
(ROC) curve depicts this tradeoff between sensitivity 
and specificity across all possible thresholds, and the 
area under this curve (AUC) is the most widely used 
metric to summarize the accuracy of a test. An AUC of 
1 indicates perfect prediction while an AUC of 0.5 repre- 
sents random guessing. 

Evidence that a bound on maximum predictive accur- 
acy exists can be found in heritability. The heritability of 
a trait (in the broad-sense) is the proportion of pheno- 
typic variation in the population that can be attributed to 
genetic variation; that is, it reflects the contribution of 
genetic factors relative to environmental ones. Narrow- 
sense heritability measures the corresponding quantity 
for additive genetic variance only, which excludes genetic 
effects such as dominance and epistasis. The heritability 
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of binary phenotypes can be computed directly on the 
observed binary scale. However, it may also be calculated 
on a liability scale, where it is assumed that an individual 
has the binary trait if their risk exceeds a threshold. Both 
types of heritability can be estimated using family-based 
studies, such as twin studies [8], and the two scales can 
be mapped to each other [9] . 

The impact of heritability on genetic test accuracy can 
be seen by examining its two extremes: a trait that has 
100% heritability, such as a Mendelian trait, can be pre- 
dicted with certainty from the genotype; in contrast, a 
trait with 0% heritability is not influenced by genetic fac- 
tors, and thus genetic tests cannot produce any useful 
information. Previous ground-breaking works have 
investigated the bounds prevalence and heritability im- 
pose on predictive accuracy using simulations [10], ana- 
lytical results utilizing genotype relative risks and their 
frequencies [11], and analytical approximations under 
the assumption of a normally distributed liability [12,13]. 
Here, we mathematically elucidate the absolute bounds 
on the specificities, sensitivities, and AUC for genetic 
testing given any values of heritability and prevalence of 
the tested trait, without making any assumptions about 
the risk distribution. 

Results 

Common complex traits are typically the combined ef- 
fect of genetic and environmental factors. Since no prac- 
tical predictor can account for all factors and their 
interactions, clinical prediction can at best assign prob- 
abilistic risks rather than deterministic outcomes. 
Viewed on the population level, these risk assignments 
can be seen as comprising a risk distribution, which is 
an estimate of the populations true risk distribution. 
Maximal predictive accuracy occurs when the estimated 
risk matches the true risk. 

The prevalence and heritability of any trait restrict the 
set of possible genetic risk distributions. If we know the 
risk corresponding to each individuals genetic profile in 
a large sample, then we can obtain an expression for 
broad-sense heritability (H 2 ) on the binary scale [10]: 



heritability = H 2 



risk(l — risk)n 



(i) 



where i = l,...,n indexes people, n is the sample size, 
riski is individual is genetic risk (i.e. the conditional 
probability of the trait given genes), and risk is the aver- 
age genetic risk, which equals the average population 
risk (see Methods). The meaning of risk depends on the 
context: for instance, when the phenotype is current dis- 
ease status, the average risk in the population is its 
prevalence, whereas in prediction of lifetime illness, risk 
is the lifetime risk of disease. (When possible, we 



nonetheless opt for the term prevalence.) Equation 1 
mathematically expresses that heritability is the propor- 
tion of phenotypic variance explained by the genetic risk 
distribution. 

To mathematically derive the risk distribution that 
yields the best genetic prediction, we model the distribu- 
tion as a histogram with equally-spaced bins located 
from 0 to 100% representing risk groups, where the 
height of each bin denotes the proportion of the popula- 
tion who fall into that risk group (for an example, see 
Figure 1). This approach can define any risk distribution. 
Indeed, multiple genetic risk distributions can correspond 
to a given combination of prevalence and heritability; each 
such distribution, however, can lend itself differently to 
genetic prediction. Our method is based precisely on de- 
termining which such distribution (for a given prevalence 
and heritability) would allow the best predictive accuracy. 
Thus, for each combination of prevalence and heritability, 
we optimized the AUC that would be achieved if every- 
one's risk were ideally ordered over the set of risk distribu- 
tions that satisfied the combination of prevalence and 
heritability; similarly, we maximized the sensitivity for any 
given specificity, prevalence, and heritability over the set 
of risk distributions and thresholds that satisfied the 
constraints. 

Using this approach, we have derived the maximum 
limits on the genetic predictive accuracy of any binary 
trait given only its prevalence and heritability. These 
values are tabulated in Additional files 1 and 2 in terms 
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Figure 1 Example risk distribution. This distribution has a 
prevalence of 30% and a heritability of 10%. The mean of the 
distribution equals the prevalence of the trait. Vorionce represents 
the variance of risk due to genetic variation, sometimes called 
genetic variance, and is proportional to heritability. 
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of the AUC and sensitivity/specificity pairs, respectively. 
Additional file 3 contains computer code in the R soft- 
ware environment [14] for the algorithms we developed. 
Figure 2 displays AUC limits over all heritabilities for 
several prevalences, and it includes a comparison with 
the limits that would exist if genetic risk followed a beta 
distribution. The beta distribution is a flexible statistical 
distribution which is consistent with the assumptions of 
previous analytical approximations of the effect of preva- 
lence and heritability on the ROC curve [12,13], because 
it can take the shape of countless smooth unimodal risk 
distributions. Furthermore, unlike previous approxima- 
tions which deteriorate at high heritabilities [12], the 
beta distribution limits do not. The limits that the beta 
distribution imposes on the AUC closely track these pre- 
vious approximations [12,13] and also match a predictive 
genomics simulation based on a multiplicative genetic 
model [10]. 

Knowledge of this maximal limit on accuracy is benefi- 
cial in the case of type 2 diabetes (T2D), where early tar- 
geted intervention can be costly but effective [15]. Many 
prediction studies of T2D have been reported, yet the 
genetic contribution to their predictive accuracy has 
been disappointing: genes alone yield -60% AUC, and 
adding genes to clinical risk factors yields incremental 
improvements of -1-2% AUC [16,17]. The heritability of 
T2D per se (as opposed to related continuous traits with 
higher heritability, e.g. body mass index) was estimated 
to be 26% by a population-based twin study [18], with a 
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Figure 2 Heritability vs. predictive accuracy. Relationship of 
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of variance explained to the maximal upper limit on AUC. The 


numbers next to the curves represent the prevalence. The maximal 


AUCs are compared with those that would exist if the genetic risk 


distribution followed a beta distribution, which is consistent with 


previous reports [10,12,13]. 





prevalence of 13%. Applying our method to these statis- 
tics determines the maximum sensitivity/specificity pairs 
displayed in Figure 3, which show that, for example, if a 
specificity of 99% is desired, sensitivity cannot exceed 
36%, and that if a sensitivity of 99% is desired, specificity 
cannot exceed 74%. Similarly, they determine the max- 
imum achievable AUC for genetic prediction of lifetime 
T2D to be 89%. This motivates the search for additional 
genetic factors influencing risk for T2D. 

Breast cancer has the same maximal AUC as T2D, al- 
beit with a distinct ROC curve from T2D. Breast cancer 
was found to have a prevalence of 4% [19], and we cal- 
culated its heritability on the binary scale to be 11% (see 
Methods), which yields a maximum AUC of 89%. Al- 
though this is the same maximum AUC as for T2D, the 
sensitivity/specificity pairs for breast cancer (Figure 3) 
are not identical to those for T2D, owing to the different 
disease parameters. For example, to reach a specificity of 
99%, sensitivity cannot exceed 24%, which is substan- 
tially lower than the corresponding maximal sensitivity 
of T2D when specificity is 99%. The divergence of these 
two ROC curves as specificity approaches 100% illus- 
trates the importance of identifying the maximal ROC 
curve, rather than relying on the maximal AUC alone. 

Heritability is the proportion of phenotypic variance 
explained by all genetic factors, but our analytic ap- 
proach can treat the proportion of phenotypic variance 
explained by any particular set of factors. If the propor- 
tion of phenotypic variance explained by a particular set 
of genes is known, that proportion of variance explained 
could be substituted for heritability in our model. For 
instance, if a subset of genes could explain 50% of the 
genetic variance of T2D (i.e. explain 13% of phenotypic 
variance), then the maximum achievable AUC of this 
subset would be 80%. 

Our method can also be applied in elucidating the 
maximum accuracy of predictors that integrate features 
such as gene expression, de novo mutation, body mass 
index, and lifestyle (which are not fully inherited). The 
proportion of variance explained by such an integrated 
predictor can then be greater than heritability. When 
there are no gene-environment interactions, this differ- 
ence is the proportion of phenotypic variation that these 
features explain independently of genes. For example, 
weekly physical activity can explain 4% of phenotypic 
variance of T2D (see Methods), is moderately heritable 
[20], and was found to not interact with well-known 
gene variants in T2D [21]. Accordingly, the proportion 
of variance explained by the integrated predictor com- 
prised of genomic profile and physical activity does not 
increment by the full 4% beyond the heritability of T2D. 
If the proportion of T2D variance that physical activity 
explains independently of genes was known to be only 
3%, say, then the integrated predictor s maximum AUC 
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Figure 3 ROC curves for type 2 diabetes and breast cancer from genomic profiles. Maximal sensitivity / 1 -specificity pairs for prediction of 
type 2 diabetes and breast cancer from full genomic profiles. The maximal pairs are compared to the pairs that would exist if the genetic risk 
distribution followed a beta distribution, which is consistent with previous reports [10,12,13]. 



would be calculated based on a proportion of variance 
explained of 29% (sum of 26% and 3%), which yields a 
maximum AUC of 90%. If, however, we did not have an 
estimate for the proportion of T2D variance that physical 
activity explains independently of genes, then we could 
conservatively use 4% in the previous calculation, yielding 
a similar AUC. This analysis applies to predictors based 
on non-genetic features that are supplemented by genetics. 
In general, the estimation of the proportion of variance 
explained by integrated predictors is complicated by the 
interaction of genetic and non-genetic features; our 
method can nonetheless be applied when the interaction 
can be estimated or bounded. Note that genetic testing 
alone can still accurately predict outcome for some small, 
extreme risk groups (such as those with highly penetrant 
variants), but such a test will not benefit the general popu- 
lation without both a high sensitivity and specificity [22]. 



Discussion 

Our results are general and apply to any binary trait, 
and they rely on only two commonly estimated para- 
meters. Although the quality of the results is only as 
good as the estimates of prevalence and heritability for 
the population in question, our method allows for ranges 
of prevalences and heritabilities to be considered, which 
can provide important insight into predictive accuracies. 
Nonetheless, care must be taken when applying these 
statistics, as different estimates apply in different situa- 
tions. For example, in assessing limits to the prediction 
of lifelong risk, lifelong risk estimates should be used 



in place of prevalence estimates. In particular, the bal- 
looning lifelong risk of T2D in the USA [23] implies 
genetic prediction of lifetime T2D will become more 
difficult. 

The method that we present here can also be used to 
determine the potential benefit of a future genomewide 
association study (GWAS) in improving predictive ac- 
curacy. To do so, we refer to estimates of GWAS pre- 
dictive power that were cleverly derived either by 
simulation studies [24] or closed-form considerations 
[25]. Both approaches measure the potential GWAS 
benefit in terms of the correlation of individuals' genetic 
risk as predicted by the GWAS to their true genetic risk. 
We can use our results to connect this measure to AUC 
and sensitivity/specificity pairs by converting this correl- 
ation to a proportion of phenotypic variance explained. 
If H 2 is the broad-sense heritability and r is the correl- 
ation of true to estimated genetic risk, then the propor- 
tion of phenotypic variance that the proposed GWAS 
may explain, R 2 , is given by [12]: 



R 2 



r 2 H 2 



(2) 



Using this approach, one may evaluate a proposed 
GWAS based on parameters such as sample size and the 
number of loci sampled. 

Heritability estimates for any binary trait can be 
used by our method. Broad-sense heritability esti- 
mates are needed to cap predictive accuracy, since 
genetic predictors can exploit dominance and epistatic 
interactions not measured by narrow-sense heritability 
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estimates. However, if a genetic predictor is constructed as 
an additive model in line with the assumptions of narrow- 
sense heritability, then its maximum accuracy can be cal- 
culated using narrow-sense heritability; thus, these esti- 
mates can also be used, albeit with a slightly different 
interpretation. Heritability estimates on the normal liabil- 
ity scale can be used after they are transformed to the 
observed binary scale, e.g. using the method proposed by 
Dempster and Lerner [8,9]. Heritability on the binary scale 
can be sensitive to prevalence [26], but its use avoids the 
assumption of normally-distributed liability, which 
requires that the trait be affected by many genes, all with 
small effect (normally-distributed liability effectively 
requires a purely unimodal genetic risk distribution). In 
fact, when variants with particularly large effects do 
exist — such as APOE in Alzheimer's disease [27], BRCA1 
and BRCA2 in breast and ovarian cancer [28], and LRRK2 
in Parkinsons disease [29] —previous authors have sug- 
gested simulations in lieu of their analytical approximation 
[13]. Moreover, because liability cannot be measured, the 
distributional assumptions on liability are virtually untest- 
able [30]. 

Our maximal ROC curves (Figure 3) can be sub- 
stantially higher than those given by the beta distribu- 
tion, which is an accurate proxy for multiple previous 
reports [10,12,13], indicating that the maximal accur- 
acies of genetic prediction may be substantially higher 
than previously thought. This difference highlights the 
importance that the risk distribution can have in the 
power of genetic prediction. Furthermore, as we are 
only now beginning to uncover the risk distributions 
of common complex diseases, it seems important to 
understand the absolute, distribution-independent limits 
on genetic-test accuracy, which we present here. 

Conclusion 

We have given exact limits on genetic prediction for any 
binary trait imposed by the epidemiological parameters 
of prevalence and heritability. Knowledge of these limits 
can help delineate the maximal benefits associated with 
genetic testing, which can allow for cost-benefit analyses, 
regulation, and clinical guidelines regarding genetic test- 
ing even before additional associations are identified. 
We have also illustrated how these limits can help us 
prioritize the allocation of research resources, by showing 
how they can assist in the prioritization and design of fu- 
ture association studies. The calculations presented in this 
paper could further be used to mitigate the possibility of 
investing in the development of a genetic test which could 
never be accurate enough to be of clinical relevance. 

Methods 

To optimize over the set of risk distributions subject to 
the disease parameters of average risk and proportion of 



variance explained (PVE), we modeled a categorical dis- 
tribution (which resembles a histogram) with b + 1 bins 
located at 0, 1/b, 2/b, . . . , 1 representing risk groups, so 
i/b represents the conditional probability of disease 
given a set of factors for individuals in risk group i (e.g. 
people in the 1/b group have risk 1/b), An example of 
such a distribution is depicted in Figure 1. The probability 
that someone falls into bin i is p b where the p/s (for 
i = 0,...,b) sum to one. We restrict the average risk 
(e.g. prevalence) and PVE (e.g. heritability) using two 
observations. (1) By the law of total probability, the un- 
conditional probability of disease is simply the mean of 
the conditional risk distribution, i.e. it is equal to the aver- 
age risk. (2) The PVE relates to the risk distribution 
through Equation 1. (Equation 1 can be understood as the 
R 2 from the regression: binary phenotype = risk + error, 
where risk is a probability.) 

Now, we perform a brief simplification of Equation 1. 
Following Wray et al. [24], we denote average risk by /<, 
and for generality we work in terms of PVE instead of H 2 : 



PVE = 1 



k(l - k)n 



k(l - k)PVE = k(l - k) 



k(l - k)PVE = k(l - k) - k 



n 



k(l - k)PVE + k 



2 _ 



(3) 



where i = l,...,n indexes individuals, n is the sample size, 
and riski is individual is genetic risk. We can relate the 
right-hand side of Equation 3 to risk groups as follows: 



risk] jn = ^ njriskf/n = ^pj 



7=0 



7=0 



Here, nj individuals have risk j/b, i.e. they are assigned 
to risk group (histogram bin) and Pj = n/n is the prob- 
ability that a random individual is assigned to risk group 

With this model of the risk distribution and con- 
straints, we can identify the best-case AUC and optimal 
sensitivity/specificity pairs using the procedures detailed 
below. Because these procedures associate a single gen- 
etic risk distribution with the best-case AUC and a po- 
tentially different risk distribution with each optimal 
sensitivity/specificity pair, it is possible that only some of 
these sensitivity/specificity pairs may be realizable for a 
single trait in practice. Consequently, these sensitivity/ 
specificity pairs cannot be used directly to derive the 
maximal AUC. 
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Area under ROC curve 

To model the AUC, we begin with the random variables 
X and Y whose probability density functions represent 
the risk distribution of those who will not and those 
who will get sick, respectively. These densities can be 
easily obtained through Bayes rule: P(X — |) = f^rfy 

and P(Y — |) = where k is the average risk. Then, 
through its equality to the Mann-Whitney- Wilcoxon U 
statistic [31], the AUC is equal to P(X < Y) + P(X = 
Y)/2. We condition on Yto evaluate this expression: 



AUC = J^P(Y = i/b) 



^P(X=j/b) + 

,7=0 



P(X = i/b) 



We would like to optimize this term, but unfortunately 
it is not convex, which would undermine our ability to 
identify the global optimum. However, after we substi- 
tute po with 1 — Y^i=\Pi > our optimization of the AUC 
becomes a convex optimization problem: 



AUC = 



Y.iPi 

i=l 



b-Y.bpj + Y.{b-l)pi + {b-i) Pi /2 

/'=1 /=1 



b 2 k(l - k) 

The numerator of this expression can be conveniently 
represented as p T Qp + b 2 k, where Q is a symmetric 
matrix whose entry at row i and column ; is -j(b + i)/2 
for i > j. 

We then maximize this AUC over the vector p subject 
to the disease parameters of average risk (k) and propor- 
tion of variance explained (PVE): 

k=j2(i/b) Pi 



k(l - k)PVE + k 2 = J (i/tfPi 

i=l 

where the sum of the p/s (for i = l,.. .,b) must not exceed 
1, and each Pi is bounded between 0 and 1. 

The parameters k, PVE, and b are predefined con- 
stants. Note that for b = 100, as well as for all the values 
of b we examined, Q is negative definite, so that this is a 
convex program. Hence, there are efficient solution 
methods to identify the global maximum. Using the 
quadprog package [32] in the R software [14], we solved 
this program for values of k and PVE with b = 100. 
When b = 1000, all maximal AUCs shown in Figure 2 
change by less than 0.01%. In fact, using b = 10 does not 
change any of these maximal AUCs more than 1% from 
that calculated with b = 1000. Note also that given an 
estimated risk distribution vector p, a researcher can dir- 
ectly calculate the AUC from the objective function. To 



calculate the AUC of the beta distribution for given 
levels of k and PVE, we discretized the beta distribution 
with parameters a = k(l/PVE-l) and b = (l-k)(l/PVE-l), 
which uniquely satisfy the constraints. 

Sensitivity/specificity pairs 

To represent each point on the optimal ROC curve, we 
model the best sensitivity (Se) and specificity (Sp) for 
any given risk threshold t/b in terms of the risk distribu- 
tion. The logic is that the best a genetic test can do is 
identify true genetic risk, so it will declare those with a 
risk greater than the threshold as positive and those with 
a lower risk as negative. Mathematically, the sensitivity 
of the test is the probability of an individual testing posi- 
tive for the trait (i.e. having risk of at least t/b) given that 
they are truly positive: 

Se = P(test + truly+) = P(test + <S-truly+) /P( truly +) 



\ i=t / i=t 



Similarly, we can derive specificity: 



We optimized sensitivity for any given value of specifi- 
city, average risk, proportion of variance explained, and 
threshold using a linear programming model. This was 
implemented in the IpSolve package in R [14] using 1000 
bins. We then optimized the sensitivities over the 
thresholds to obtain the maximal sensitivity for every 
specificity, average risk, and proportion of variance 
explained. 

Calculations for examples 

To calculate the proportion of T2D variance explained 
by physical activity we used Equation 1, where the risk 
distribution was defined by the prevalence and the rela- 
tive risks of exercise [33]. To calculate the heritability of 
breast cancer on the binary scale we used twice the dif- 
ference in correlation between monozygotic and dizyg- 
otic twin pairs, where correlations were computed on 
binary outcomes from 44,788 pairs of Nordic twins [34]. 

Additional files 



Additional file 1: Table of maximum AUCs. These are the maximum 
AUCs corresponding to Figure 2 for all values of prevalence. Row names 
represent values of heritability (computed on the observed binary scale) 
or proportion of phenotypic variance explained, and column names 
represent values of prevalence. 

Additional file 2: Table of maximum sensitivities for each 
specificity. Rows represent the combination of heritability (H.sq, 
computed on the observed binary scale) and prevalence (Prev), while 



Dreyfuss et al. BMC Genomics 2012, 13:340 
http://www.biomedcentral.com/1471 -21 64/1 3/340 



Page 7 of 8 



columns represent specificities. The elements are the maximal sensitivity 
in each case. 

Additional file 3: Archive containing instructions (readme.txt) and 
computer code (maxAcc.r) to implement our algorithms. The code is 
written in the free statistical language and environment R (http://www.r- 
project.org), relies on free R optimization packages, and is copyrighted by 
the permissive MIT license (http://www.opensource.org/licenses/mit- 
license.html). Updated versions are freely available for download at: 
http://code.google.eom/p/max-accuracy-genetic-pred/. 
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